Choosing a Distance Metric for Automatic Word Categorization
نویسندگان
چکیده
WORD CATEGORIZATION Emin Erkan Korkmaz G okt urk U coluk Department of Computer Engineering Middle East Technical University Ankara-Turkey Emails: [email protected] [email protected] Abstract This paper analyzes the functionality of different distance metrics that can be used in a bottom-up unsupervised algorithm for automatic word categorization. The proposed method uses a modi ed greedy-type algorithm. The formulations of fuzzy theory are also used to calculate the degree of membership for the elements in the linguistic clusters formed. The unigram and the bigram statistics of a corpus of about two million words are used. Empirical comparisons are made in order to support the discussions proposed for the type of distance metric that would be most suitable for measuring the similarity between linguistic elements.
منابع مشابه
A Method for Improving Automatic Word Categorization
A METHOD FOR IMPROVING AUTOMATIC WORD CATEGORIZATION Korkmaz, Emin Erkan MS., Department of Computer Engineering Supervisor: Ass. Prof. Dr. G okt urk U coluk September 1997, 57 pages In this thesis study a new approach to automatic word categorization which improves both the e ciency of the algorithm and the quality of the formed clusters is presented. The unigram and the bigram statistics ...
متن کاملیادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیکهای یادگیری معیار فاصله
Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...
متن کاملEmin Erkan Korkmaz and Gg Okt Urk Uu Coluk (1997) a Method for Improving Automatic Word Categorization. a Method for Improving Automatic Word Categorization
This paper presents a new approach to automatic word categorization which improves both the eeciency of the algorithm and the quality of the formed clusters. The unigram and the bigram statistics of a corpus of about two million words are used with an eecient distance function to measure the similarities of words, and a greedy algorithm to put the words into clusters. The notions of fuzzy clust...
متن کاملOne Size Fits All? A Simple Technique to Perform Several NLP Tasks
Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their ...
متن کاملMethod for Improving Automatic Word Categorization
This paper presents a new approach to automatic word categorization which improves both the efficiency of the algorithm and the quality of the formed clusters. The unigram and the bigram statistics of a corpus of about two million words are used with an efficient distance function to measure the similarities of words, and a greedy algorithm to put the words into clusters. The notions of fuzzy c...
متن کامل